Introduction
The distributions of annual peak flows at USGS streamgages are used estimate flood frequency-magnitude characteristics at gaged sites by use of the log-Peason type-III distribution. This distribution is described by measures of the mean, variance, and skewness of the logarithm of annual peak flows. Estimates of high magnitude floods are particularly sensitive to the skewness of the distribution, although skewness estimates based on data from an individual streamgage are highly variable. Based on guidelines provided in Bulletin 17B and 17C, a regional estimate of flood skew is combined with an at-site (station skew) estimate to help stabilize the estimate of flood frequency characteristics.
A Bayesian generalized least squares (B-GLS) model is the standard approach for estimating the regional skew by regression on basin and climatic characteristics. This approach accounts for the length of record and spatial correlations among contemporaneous annual peaks in a network of streamgages by appropriately adjusting model parameters and uncertainty estimates. Despite the capabilities of the underlying model, identifying individual basin or climatic characteristics that are statistically associated with flood skew is problematic. Therefore, a constant is commonly used to estimate regional skew under the B-GLS model.
It may be the case, however, that subtle spatial variations in regional skew are caused by a combination of factors that are not linearly associated with available basin or climatic characteristics. Such effects may result in a persistent spatial pattern in the residuals of the skew estimates. To assess this possibility, a generalized additive model (GAM) is applied to the skew residuals from a B-GLS analysis of two, two-digit hydrologic regions, which includes US Great Lakes (04) and the Ohio River (05) basins. The GAM model provides a flexible basis of smoothing functions to accommodate irregular variations in covariates, such as spatial coordinates of gaged basin centroids. Linear components also are accommodated in a GAM model.
Computed station skew statistics from 368 streamgages in hydrologic unit regions 04 and 05 with 35 or more years of annual peak flow data were used in this analysis. Station skews were computed by use of methods described in Bulletin 17C. A preliminary analysis of the data describes the spatial distribution of gaged basin centroids, and analyzes the distribution of station skew values for outliers. A GAM model was fit to the skew values using the eastings and northing of basin centroids to provide an initial assessment of the potential utility of the GAM model. Within the GAM analysis, station-skews for individual streamgages will be weighted by their record length. This preliminary assessment was followed by repeated, random partitioning of the data set into training and testing subsets containing 80- and 20-percent of the full data set, respectively. The testing data set was used to determine whether the GAM model outperformed the constant estimated by use of the B-GLS model.
Initialize Environment
library(tidyverse)
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.0 --[39m
[30m[32mv[30m [34mggplot2[30m 2.2.1 [32mv[30m [34mpurrr [30m 0.2.4
[32mv[30m [34mtibble [30m 1.3.4 [32mv[30m [34mdplyr [30m 0.7.4
[32mv[30m [34mtidyr [30m 0.7.2 [32mv[30m [34mstringr[30m 1.2.0
[32mv[30m [34mreadr [30m 1.1.1 [32mv[30m [34mforcats[30m 0.2.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
library(knitr)
Read in Skew Data for HUC Regions 04 and 05
df <- read_tsv(file = 'C:/Home/SW_Specialist/Skew0405/data/skew_data.txt') %>%
mutate(Station = paste0('0',USGS),
Resid_skew_sign = factor(sign(Residuals))) %>%
rename(Record_len = Pseudo_Length,
Station_skew = Station_Skew,
Residual_skew = Residuals)
Parsed with column specification:
cols(
Index = col_integer(),
USGS = col_integer(),
Pseudo_Length = col_integer(),
Station_Skew = col_double(),
log10_DA = col_double(),
Lat_Cntrd = col_double(),
Long_Cntrd = col_double(),
Residuals = col_double()
)
List Subset of Data and Summarize
kable(df[1:20,c('Station','Station_skew','Residual_skew','Lat_Cntrd','Long_Cntrd')],
caption = 'Sample of Skew Residual Data')
| 03336645 |
0.419 |
0.332 |
40.39385 |
-88.00935 |
| 03336900 |
0.000 |
-0.086 |
40.25089 |
-88.07321 |
| 03343400 |
-0.736 |
-0.822 |
39.92401 |
-88.14127 |
| 03344500 |
0.102 |
0.016 |
39.34818 |
-88.00714 |
| 03345500 |
-0.827 |
-0.913 |
39.51656 |
-88.13700 |
| 03346000 |
-0.304 |
-0.390 |
39.26853 |
-87.91700 |
| 03378000 |
-0.456 |
-0.542 |
38.53839 |
-87.94023 |
| 03378635 |
0.334 |
0.248 |
39.30351 |
-88.52165 |
| 03379500 |
-0.599 |
-0.686 |
38.99769 |
-88.48511 |
| 03380500 |
-0.037 |
-0.123 |
38.56331 |
-88.72489 |
| 03381500 |
0.239 |
0.153 |
38.64497 |
-88.44497 |
| 03382100 |
1.739 |
1.653 |
37.61436 |
-88.85263 |
| 03384450 |
-0.355 |
-0.441 |
37.53137 |
-88.52718 |
| 03385000 |
-0.231 |
-0.317 |
37.47848 |
-88.61897 |
| 03274650 |
-1.257 |
-1.343 |
40.04106 |
-85.10903 |
| 03275000 |
-0.360 |
-0.446 |
39.84952 |
-85.09757 |
| 03275600 |
-0.210 |
-0.297 |
39.86655 |
-84.84453 |
| 03276700 |
0.518 |
0.431 |
39.08198 |
-85.11773 |
| 03277000 |
0.076 |
-0.010 |
39.13191 |
-85.21422 |
| 03291780 |
-0.639 |
-0.725 |
38.93353 |
-85.28100 |
Plot Distribution of Streamgage Centrois
Note: symbols discretize residuals into positive and negative values
us_map %>%
filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
'pennsylvania','new york', 'kentucky', 'west virginia',
'vermont')) %>%
ggplot( aes(x = long, y = lat, group = group)) +
geom_polygon(fill = 'tan', color = 'black') +
geom_polygon(data = huc2_0405, aes(x = long, y = lat, group = group),
fill = NA, color = 'blue') +
geom_point( data = df, aes( x = Long_Cntrd, y = Lat_Cntrd, group = NULL),
color = 'red') +
coord_map('conic', lat0 = 42)
Regions defined for each Polygons

Project Latitudes and Longitudes of Basin Centroids
library(rgdal)
library(sp)
df_prj <- df
coordinates(df_prj) <- c('Long_Cntrd', 'Lat_Cntrd')
class(df_prj)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"
proj4string(df_prj) <- "+proj=longlat +datum=NAD83"
# Same transform as
# EPSG:102003 USA_Contiguous_Albers_Equal_Area_Conic'
proj_sel <- 'EPSG:102004 USA_Contiguous_Lambert_Conformal_Conic'
# EPSG:102005 USA_Contiguous_Equidistant_Conic
df_prj <- spTransform(df_prj, CRS = CRS("+init=esri:102004"))
easting <- attributes(df_prj)$coords[,1]
northing <- attributes(df_prj)$coords[,2]
plot(easting, northing, pch = 16, col = 'blue')

east_std <- (easting - mean(easting ))/100000
nrth_std <- (northing - mean(northing))/100000
plot(east_std, nrth_std, pch = 16, col = 'green4',
main = paste('Regional Skew Streamgages',proj_sel),
xlab = 'Standardized Easting', ylab = 'Standardized Northing')

df %>%
ggplot( aes( x = Station_skew)) +
geom_histogram() +
geom_vline( xintercept = 0, color = 'red', linetype = 'dashed')

ndxOut <- which(df$Station_skew > 2)
# Remove outlier
df <- df[-ndxOut,]
df %>%
ggplot( aes( x = Station_skew)) +
geom_freqpoly( aes(y = ..density..), color = 'salmon', size = 1.2) +
geom_vline( xintercept = 0, color = 'red', linetype = 'dashed') +
stat_function(fun = dnorm, args = list(mean = mean(df$Station_skew),
sd = sd(df$Station_skew)),
color = 'blue') +
theme_few()

library(mgcv)
df$east <- east_std[-ndxOut]
df$nrth <- nrth_std[-ndxOut]
gam1 <- gam(Station_skew ~ s(east, nrth), weights = Record_len, data = df)
summary(gam1)
Family: gaussian
Link function: identity
Formula:
Station_skew ~ s(east, nrth)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07062 0.02185 3.231 0.00135 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(east,nrth) 24.42 27.72 3.847 7.74e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.215 Deviance explained = 26.8%
GCV = 12.151 Scale est. = 11.307 n = 366
plot(gam1)

df$gam_resid <- gam1$residuals
df %>%
ggplot( aes(x = gam_resid) ) +
geom_density(colour = 'red', size = 1.5) +
geom_density( aes(x = Residual_skew), color = 'blue', size = 1.5) +
theme_few()

us_map %>%
filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
'pennsylvania','new york', 'kentucky', 'west virginia',
'vermont')) %>%
ggplot( aes(x = long, y = lat, group = group)) +
geom_polygon(fill = 'tan', color = 'black') +
coord_map('conic', lat0 = 42) +
# theme_void() + # This eliminates lat/lon marks
# theme_few() +
geom_point(data = df, aes( x=Long_Cntrd, y = Lat_Cntrd,
colour = gam_resid, group = NULL)) +
scale_colour_gradient2(low = "red" , mid = "white",
high = "blue" )

Skew Contour Map
A skew contour map is developed that is based on the gam1 model above.
plot_ly( x = c(xvec), y = c(yvec),
z = matrix(zvec,40,40, byrow = TRUE), type = 'contour',
contours = list(start = -0.5, end = 0.7, size = 0.2, showlabels = TRUE)) %>%
add_trace(type = 'scatter', mode = 'markers', x = df$east, y = df$nrth,
color = 'red', hovertext = paste(df$Station, df$Station_skew))
'scatter' objects don't have these attributes: 'z', 'contours'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
'scatter' objects don't have these attributes: 'z', 'contours'
Valid attributes include:
'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'ids', 'customdata', 'hoverinfo', 'hoverlabel', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'hovertext', 'mode', 'hoveron', 'line', 'connectgaps', 'cliponaxis', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xcalendar', 'ycalendar', 'idssrc', 'customdatasrc', 'hoverinfosrc', 'xsrc', 'ysrc', 'textsrc', 'hovertextsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule'
---
title: "Preliminary Analysis of Skew Residuals in HUCs 04 and 05"
author: Dave Holtschlag
output: html_notebook
---

## Introduction

The distributions of annual peak flows at USGS streamgages are used estimate flood frequency-magnitude characteristics at gaged sites by use of the log-Peason type-III distribution.  This distribution is described by measures of the mean, variance, and skewness of the logarithm of annual peak flows.  Estimates of high magnitude floods are particularly sensitive to the skewness of the distribution, although skewness estimates based on data from an individual streamgage are highly variable. Based on guidelines provided in Bulletin 17B and 17C, a regional estimate of flood skew is combined with an at-site (station skew) estimate to help stabilize the estimate of flood frequency characteristics.  

A Bayesian generalized least squares (B-GLS) model is the standard approach for estimating the regional skew by regression on basin and climatic characteristics.  This approach accounts for the length of record and spatial correlations among contemporaneous annual peaks in a network of streamgages by appropriately adjusting model parameters and uncertainty estimates. Despite the capabilities of the underlying model, identifying individual basin or climatic characteristics that are statistically associated with flood skew is problematic.  Therefore, a constant is commonly used to estimate regional skew under the B-GLS model.  

It may be the case, however, that subtle spatial variations in regional skew are caused by a combination of factors that are not linearly associated with available basin or climatic characteristics. Such effects may result in a persistent spatial pattern in the residuals of the skew estimates.  To assess this possibility, a generalized additive model (GAM) is applied to the skew residuals from a B-GLS analysis of two, two-digit hydrologic regions, which includes US Great Lakes (04) and the Ohio River (05) basins. The GAM model provides a flexible basis of smoothing functions to accommodate irregular variations in covariates, such as spatial coordinates of gaged basin centroids.  Linear components also are accommodated in a GAM model.

Computed station skew statistics from 368 streamgages in hydrologic unit regions 04 and 05 with 35 or more years of annual peak flow data were used in this analysis.  Station skews were computed by use of methods described in Bulletin 17C.  A preliminary analysis of the data describes the spatial distribution of gaged basin centroids, and analyzes the distribution of station skew values for outliers.  A GAM model was fit to the skew values using the eastings and northing of basin centroids to provide an initial assessment of the potential utility of the GAM model. Within the GAM analysis, station-skews for individual streamgages will be weighted by their record length. This preliminary assessment was followed by repeated, random partitioning of the data set into training and testing subsets containing 80- and 20-percent of the full data set, respectively.  The testing data set was used to determine whether the GAM model outperformed the constant estimated by use of the B-GLS model. 

## Initialize Environment

```{r setup}
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(knitr)
```
***

## Read in Skew Data for HUC Regions 04 and 05

```{r read_data}
df <- read_tsv(file = 'C:/Home/SW_Specialist/Skew0405/data/skew_data.txt') %>% 
   mutate(Station = paste0('0',USGS),
          Resid_skew_sign = factor(sign(Residuals))) %>% 
   rename(Record_len      = Pseudo_Length,
          Station_skew    = Station_Skew,
          Residual_skew   = Residuals) 

```

## List Subset of Data and Summarize

```{r list_data, fig.cap = 'Table showing contents of skew data set'}

kable(df[1:20,c('Station','Station_skew','Residual_skew','Lat_Cntrd','Long_Cntrd')],
      caption = 'Sample of Skew Residual Data')

# Print summary of 
summary(df)

```

## Plot Distribution of Streamgage Centrois

Note: symbols discretize residuals into positive and negative values

```{r convert_coord}
library(rgdal)
library(sp)

path <- 'C:/Home/SW_Specialist/Skew0405/GIS/WBD_05_HU2_Shape/Shape' 
huc2_ohio_river  <- readOGR(dsn = path, layer = "WBDHU2")

path <- 'C:/Home/SW_Specialist/Skew0405/GIS/WBD_04_HU2_Shape/Shape'
huc2_great_lakes <- readOGR(dsn = path, layer = "WBDHU2") 

huc2_0405        <- rbind(huc2_great_lakes, huc2_ohio_river)

# How to extract dataframe from spatial polygon
spp <- SpatialPolygonsDataFrame(huc2_0405, data = as.data.frame(HUC2))
dfx <- data.frame(id = getSpPPolygonsIDSlots(huc2_0405))
row.names(dfx) <- getSpPPolygonsIDSlots(huc2_0405)
spdf <- SpatialPolygonsDataFrame(huc2_0405, data =dfx)

huc2_0405 %>% 
   ggplot( aes(x = long, y = lat, group = group)) +
   geom_polygon(fill = colors()[18], color = "black") +
   geom_point(df, aes(x = Long_Cntrd, y = Lat_Cntrd))


   geom_polygon(huc2_ohio_river, aes(x = long, y = lat, group = group),
                fill = colors()[15], color = 'red')


fill = colors()[18], 
+ 
   
   geom_polygon(data = states_ohio_river, 
                aes(x = long, y = lat, group = group),
                fill = NA, color = 'grey') +
   geom_point(data = skew_sites, 
              aes( x = LONG_CENT, y = LAT_CENT, group = NULL),
              color = 'red', 
              size  = 0.5) +
   coord_map() +
   theme_few() + xlab('Longitude') + ylab('Latitude') 




us_map <- map_data('state')

us_map %>% 
   filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
                        'pennsylvania','new york', 'kentucky', 'west virginia',
                        'vermont')) %>% 
   ggplot( aes(x = long, y = lat, group = group)) + 
   geom_polygon(fill = 'tan', color = 'black') + 
   geom_polygon(data = huc2_0405, aes(x = long, y = lat, group = group),
                fill = NA, color = 'blue') +
   geom_point(  data = df, aes( x = Long_Cntrd, y = Lat_Cntrd, group = NULL), 
                color = 'red') + 
   coord_map('conic', lat0 = 42) +
   # theme_void() + # This eliminates lat/lon marks 
   theme_few() +
   geom_point(data = df, aes( x=Long_Cntrd, y = Lat_Cntrd, 
                              colour = Resid_skew_sign, group = NULL)) +
   scale_colour_manual( values = c('red', 'blue'))


```
***
## Project Latitudes and Longitudes of Basin Centroids

```{r proj_lat_long}
library(rgdal)
library(sp)

df_prj  <- df

coordinates(df_prj) <- c('Long_Cntrd', 'Lat_Cntrd')
class(df_prj)

proj4string(df_prj) <- "+proj=longlat +datum=NAD83"

# Same transform as 
#  EPSG:102003 USA_Contiguous_Albers_Equal_Area_Conic'
proj_sel <-  'EPSG:102004 USA_Contiguous_Lambert_Conformal_Conic'
#  EPSG:102005 USA_Contiguous_Equidistant_Conic
df_prj <- spTransform(df_prj, CRS = CRS("+init=esri:102004"))

easting  <- attributes(df_prj)$coords[,1]
northing <- attributes(df_prj)$coords[,2]


plot(easting, northing, pch = 16, col = 'blue')

east_std <- (easting  - mean(easting ))/100000
nrth_std <- (northing - mean(northing))/100000

plot(east_std, nrth_std, pch = 16, col = 'green4',
     main = paste('Regional Skew Streamgages',proj_sel),
     xlab = 'Standardized Easting', ylab = 'Standardized Northing')

```
```{r skew_dist}

df %>%
   ggplot( aes( x = Station_skew)) +
   geom_histogram() + 
   geom_vline( xintercept = 0, color = 'red', linetype = 'dashed')

ndxOut <- which(df$Station_skew > 2)

# Remove outlier

df <- df[-ndxOut,]


df %>%
   ggplot( aes( x = Station_skew)) +
   geom_freqpoly( aes(y = ..density..), color = 'salmon', size = 1.2) + 
   geom_vline( xintercept = 0, color = 'red', linetype = 'dashed') +
   stat_function(fun = dnorm, args = list(mean = mean(df$Station_skew),
                                          sd   =   sd(df$Station_skew)),
                 color = 'blue') +
   theme_few()

```


```{r gam_anal}

library(mgcv)

df$east <- east_std[-ndxOut]
df$nrth <- nrth_std[-ndxOut]


gam1 <- gam(Station_skew ~ s(east, nrth), weights = Record_len, data = df)

summary(gam1)

plot(gam1)

df$gam_resid <- gam1$residuals

```



```{r resid_plot}

df %>% 
   ggplot( aes(x = gam_resid) ) +
   geom_density(colour = 'red', size = 1.5) +
   geom_density( aes(x = Residual_skew), color = 'blue', size = 1.5) +
   theme_few() 

us_map %>% 
   filter(region %in% c('michigan', 'ohio', 'indiana', 'wisconsin', 'illinois',
                        'pennsylvania','new york', 'kentucky', 'west virginia',
                        'vermont')) %>% 
   ggplot( aes(x = long, y = lat, group = group)) + 
   geom_polygon(fill = 'tan', color = 'black') + 
   coord_map('conic', lat0 = 42) +
   # theme_void() + # This eliminates lat/lon marks 
   # theme_few() +
   geom_point(data = df, aes( x=Long_Cntrd, y = Lat_Cntrd, 
                              colour = gam_resid, group = NULL)) +
   scale_colour_gradient2(low  = "red" , mid = "white",
                          high = "blue" )


```

## Skew Contour Map

A skew contour map is developed that is based on the gam1 model above.  

```{r contour_info}

tmp  <- plot(gam1)


xvec <- tmp[[1]]$x
yvec <- tmp[[1]]$y
zvec <- tmp[[1]]$fit

plot_ly( x = c(xvec), y = c(yvec),
   z = matrix(zvec,40,40, byrow = TRUE), type = 'contour', 
   contours = list(start = -0.5, end = 0.7, size = 0.2, showlabels = TRUE)) %>% 
   add_trace(type = 'scatter', mode = 'markers', x = df$east, y = df$nrth, 
               color = 'red', hovertext = paste(df$Station, df$Station_skew)) 




```